Team: Brian Gaither, Aurian Ghaemmaghami, Drew Larsen and Sabrina Purvis
The selected dataset documents all reported crimes in the 5 boroughs of New York City up to but excluding murder. We hope to gather insights related to location, probability and crime type classification. With these findings, we believe we could make recommendations on police training, shift allocations, and jail/prison capacity.
As you will see below and in further lab submissions, we believe we can find meaningful clusters in time of day, day of week, day of month and possibly even season with relation to crime counts and locations. Additionally, we plan to classify crime types and utilize that variable.
We are exploring two specific aims with this dataset:
Our definition of a successful model is one that is able to provide actionable insights and intel in addition to prediction. Being able to cluster and classify types of crimes will allow practical application around policing, whereas prediction of crime probability provides suggestions around areas to monitor.
This data has been gathered from https://www.kaggle.com/adamschroeder/crimes-new-york-city.
Dataset has 1,048,575 observations with 24 variables.
This dataset has a second file provided that provides each of the 5 borough population metrics by decade from 1950 with projections through 2040.
| Variable Label | Variable Type | Data Description |
|---|---|---|
| CMPLNT_NUM | Continuous | Randomly generated persistent ID for each complaint |
| CMPLNT_FR_DT | Datestamp | Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) |
| CMPLNT_FR_TM | Timestamp | Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) |
| CMPLNT_TO_DT | Datestamp | Ending date of occurrence for the reported event, if exact time of occurrence is unknown |
| CMPLNT_TO_TM | Timestamp | Ending time of occurrence for the reported event, if exact time of occurrence is unknown |
| RPT_DT | Datestamp | Date event was reported to police |
| KY_CD | Categorical | Three digit offense classification code |
| OFNS_DESC | Categorical (string) | Description of offense corresponding with key code |
| PD_CD | Categorical | Three digit internal classification code (more granular than Key Code) |
| PD_DESC | Categorical (string) | Description of internal classification corresponding with PD code (more granular than Offense Description) |
| CRM_ATPT_CPTD_CD | Categorical | Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely |
| LAW_CAT_CD | Categorical | Level of offense: felony, misdemeanor, violation |
| JURIS_DESC | Categorical | Jurisdiction responsible for incident. Either internal, like Police, Transit, and Housing; or external, like Correction, Port Authority, etc. |
| BORO_NM | Categorical | The name of the borough in which the incident occurred |
| ADDR_PCT_CD | Categorical | The precinct in which the incident occurred |
| LOC_OF_OCCUR_DESC | Categorical | Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of |
| PREM_TYP_DESC | Categorical | Specific description of premises; grocery store, residence, street, etc. |
| PARKS_NM | Categorical | Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included) |
| HADEVELOPT | Categorical | Name of NYCHA housing development of occurrence, if applicable |
| X_COORD_CD | Continuous | X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |
| Y_COORD_CD | Continuous | Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) |
| Latitude | Continuous | Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |
| Longitude | Continuous | Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) |
We knew from the initial dataset, we would require exogenous variables, preferably that were continuous. We selected the following items to include as potential inputs.
Weather data was captured from https://www.ncdc.noaa.gov/. We pulled all weather data from 2010 to 2020 and set the location as Central Park Station.
While the information available was plentiful, the variables we selected and captured are as described here:
| Variable code | Description | Unit of Measure |
|---|---|---|
| PRCP | Precipitation | Inches |
| SNOW | Snowfall | Inches |
| TMAX | Maximum temperature | Fahrenheit |
| TMIN | Minimum temperature | Fahrenheit |
The income data was gathered from https://www.baruch.cuny.edu/nycdata/income-taxes/per_cap.htm. From here, we captured the following information.
| Variable | Description | Unit Of Measure |
|---|---|---|
| Borough Name | Per Capita Income (Per Year) | USD |
Data on average housing prices was gathered from https://data.cityofnewyork.us/City-Government/DOF-Summary-of-Neighborhood-Sales-by-Neighborhood-/5ebm-myj7. For this set, we gathered data for 2014 & 2015 and established a low, average, medium and high sale price value per borough. We next classified the home sale prices in a range of low, medium, high and very high.
The population data was provided from Kaggle as an input to the overall dataset. It was provided by borough, based on the census data by decade. We will use 2010 as our dataset.
| Variable | Description | Unit Of Measure |
|---|---|---|
| Borough Name | Population | Count (Continuous) |
Holiday and event data was gathered through various search engine queries and compiled together. It consists of standard US Holidays and event such as parades that occured in 2014 and 2015.
The data quality of the original raw dataset has been analyzed with key findings outlined below.
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import folium
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import json
from folium import plugins
import altair as alt
import matplotlib.dates as mdates
from datetime import timedelta
from statsmodels.nonparametric.smoothers_lowess import lowess
from matplotlib.dates import DayLocator, MonthLocator, DateFormatter, drange
#load csv
drew_file_path = '/home/drew/School/Semester4/ML1/NewYorkCityCrimes2015/Data/NYPD_Rollup_Joined_All.csv'
file_path = "C:\\Users\\BGaither\\OneDrive - McAfee\\Other\\school\\SMU\\Courses\\Fall 2020\\Machine Learning I\\Lab1\\Data_NYPD\\Modified_Data_Sets\\FE1_NYPD_Complaint_Data_Historic.csv"
f_path_old = "C:\\Users\\BGaither\\OneDrive - McAfee\\Other\\school\\SMU\\Courses\\Fall 2020\\Machine Learning I\\Lab1\\Data_NYPD\\NYPD_Complaint_Data_Historic.csv"
#load Data into Pandas
df = pd.read_csv(file_path)
#original dataset
df_old = pd.read_csv(f_path_old)
#load csv
drew_filePath = '/home/drew/School/Semester4/ML1/NewYorkCityCrimes2015/Data/NYPD_Rollup_Joined_All.csv'
filePath = "C:\\Users\\BGaither\\OneDrive - McAfee\\Other\\school\\SMU\\Courses\\Fall 2020\\Machine Learning I\\Lab1\\Data_NYPD\\Modified_Data_Sets\\NYPD_Rollup_Joined_All.csv"
#load Data into Pandas
df_Crime = pd.read_csv(filePath)
Looking at the original raw data, there are several NaNs in the dataframe. Given most of the data is categorical data, we will not impute these values but rather be selective with the features we include and drop the NaNs.
We have considered using the mean duration between CMPLNT_FR_TM and CMPLNT_TO_TM to impute the missing CMPLNT_TO_DT and CMPLNT_TP_TM should we determine that doing so will be of value.
The number missing values by column are shown below:
len(df) - df.count()
We identified 7 records with the incorrect year of 1015 entered instead of 2015. In this case, we have converted such values to the correct year of 2015.
len(df_old[(df_old['CMPLNT_FR_DT'].notnull())&(df_old['CMPLNT_FR_DT'].str.contains("1015"))])
We've determined there are no duplicate records in the raw dataset. This was checked with the code below.
df_Dupes = df[df.duplicated()]
print(df_Dupes)
Our crime data is made up of timestamp, categorical and geospacial data and no continuous variables. As a result, we have no outliers in the raw form of the data.
Below we can see various details for each feature such as non-null value counts, unique counts, top (most commonly occurring value among all values), and frequency (count of occurrances of the most commonly occurring values).
df.describe(exclude=[np.number])
With the bar chart below, we can see that 63% of the data is made up of 6 types of crimes: Petit Larceny, Harrassment 2, Assault 3 & Related Offenses, Criminal Mischief & Related Offenses, Grand Larceny, and Dangerous Drugs. There is a long tail of other crimes that drops off in volume significantly after the top 6. We chose to illustrate this point with bar chart to show comparative differences between crime type volumes.
import altair as alt
totalRecords = 1048575
df_crime_prop = df.groupby(['OFNS_DESC']).agg(count_cmplnt=('CMPLNT_NUM','count')).reset_index()
df_crime_prop['percent_of_data'] = df_crime_prop['count_cmplnt']/totalRecords
alt.Chart(df_crime_prop).mark_bar().encode(
x = alt.Y('OFNS_DESC:N', sort='-y'),
y='percent_of_data:Q',
tooltip = ['OFNS_DESC', 'percent_of_data']
).properties(width = 800, height = 300, title="Proportion of Crimes in the Dataset")
As the original data is mostly timestamp and categorical data with the exception of geopositional data such as longitude and latitude, for instance, we have aggregated the data by day to analyze trends in crime volume by day focusing on the top 6 crimes by volume: Petit Larceny, Harrassment 2, Assault 3 & Related Offenses, Criminal Mischief & Related Offenses, Grand Larceny, and Dangerous Drugs.
Below, we've constructed various boxplots to analyze the volume of daily crimes by New York City borough. We can see that Manhattan has the highest median daily crime volume at 18.5 crimes per day, whereas Staten Island has the lowest median crimes per day at 1. The boxplot allows us to compare the distribution of crimes per day by borough. We can see that each of the boroughs have right skewed distributions.
#take random samples
size = 100 # sample size
replace = True # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df_smpl = df_Crime.groupby('BORO_NM', as_index=False).apply(fn).reset_index()
alt.Chart(df_smpl).mark_boxplot().encode(
x='BORO_NM:N',
y='count_cmplnt:Q'
).properties(width = 500, height = 300,title="Boxplot of Number of Crimes Per Day by Borough")
By analyzing the volume of daily crimes by season, we can see that both Spring and Summer have the highest median crime rate per day at 8 crimes per day each. Fall has a mediam daily crime rate of 5 and Winter has a median daily crime rate of 4 crimes per day. The boxplot was chosen to compare the distribution of crimes per day by season.
#take random samples
size = 100 # sample size
replace = True # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df_smpl = df_Crime.groupby('Season', as_index=False).apply(fn).reset_index()
alt.Chart(df_smpl).mark_boxplot().encode(
x='Season:N',
y='count_cmplnt:Q'
).properties(width = 500, height = 300,title="Boxplot of Number of Crimes Per Day by Season")
Analyzing the daily crime rate by Month reveals that June has the highest median daily crime rate at 12.5 crimes per day. December has the lowest median daily crime rate at 5 crimes per day. The boxplot was chosen to compare the distribution of crimes per day by month.
#take random samples
size = 100 # sample size
replace = True # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df_smpl = df_Crime.groupby('Month', as_index=False).apply(fn).reset_index()
mySort = ["January","February","March","April","May","June","July","August","September","October","November","December"]
alt.Chart(df_smpl).mark_boxplot().encode(
x= alt.Y('Month:N',sort=mySort),
y='count_cmplnt:Q'
).properties(width = 500, height = 300,title="Boxplot of Number of Crimes Per Day by Month")
Using Start Date/Time and End Date/Time to calculate duration of crime. Using this new "Duration" feature, we've created boxplots to analyze any patterns around duration and different categorical variables.
Below, we can see that Felony has a larger duration than Misdemeanor and Violation.
#Duration box plot by crime type
ax = sns.boxplot(x="LAW_CAT_CD", y="Duration", data=df)
ax.set_title("Boxplot of Duration of Crime by Crime Type")
The following boxplot assesses Duration against day time which indicates Morning has a larger spread of duration than other times of day.
#Duration by time of day
ax = sns.boxplot(x="Daytime", y="Duration", data=df)
ax.set_title("Boxplot of Duration of Crime by Daytime")
Looking at Duration by Borough doesn't reveal a large difference between Boroughs and duration.
#Duration by Borough
ax = sns.boxplot(x="BORO_NM", y="Duration", data=df)
ax.set_title("Boxplot of Duration of Crime by Borough")
Looking at duration by offense description reveals that certain crimes have a larger median duration. For instance, Grand Larceny of Motor Vehicle has a median duration of 12 hours compared to other crimes which have a median duration of 1-2 hours. The boxplot gives a good idea of the distribution of the duration the crime by offense description to help understand if more serious or complex crimes have higher IQRs or overall longer durations. This information could be helpful in classification of offenses.
#take random samples by OFNS_DESC for box plot
size = 25 # sample size
replace = True # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df_smpl = df.groupby('OFNS_DESC', as_index=False).apply(fn).reset_index()
alt.Chart(df_smpl).mark_boxplot().encode(
x='OFNS_DESC:N',
y='Duration:Q'
).properties(title="Boxplot of Duration for each Offense Description")
Below is a plot of crimes per day in New York with a LOWESS curve. The purpose of the LOWESS curve is to show the trend of crimes: It looks like crime is higher in the summer months than the winter months, and that tends to be a yearly cycle. We chose to create a scatter plot with a curve over it since there were 760 points of highly noisy data. Another plot would have likely been too noisy, and the curve highlights the overall trend that may be hard to see in the noisy daily crimes data.
df_Crime['CMPLNT_FR_DT'] = pd.to_datetime(df_Crime['CMPLNT_FR_DT'], format = '%m/%d/%Y')
crimeByDate = df_Crime.groupby('CMPLNT_FR_DT')['count_cmplnt'].sum().reset_index()
fig, ax = plt.subplots()
plt.xlabel('Date of Crime')
plt.ylabel('Number of Crimes')
plt.grid(True)
plt.ylim(-5, 1500)
plt.title('Number of Crimes per Date in NYC with LOWESS Curve')
plt.plot_date(crimeByDate['CMPLNT_FR_DT'], crimeByDate['count_cmplnt'])
plt.plot_date(crimeByDate['CMPLNT_FR_DT'],lowess(crimeByDate['count_cmplnt'],crimeByDate['CMPLNT_FR_DT'], frac = .25))
ax.xaxis.set_tick_params(rotation=45, labelsize=10)
ax.xaxis.set_major_formatter(DateFormatter('%m/%Y'))
plt.show()
#Code source: https://medium.com/@madhuramiah/geographic-plotting-with-python-folium-2f235cc167b7
import pandas as pd
import folium
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import json
from folium import plugins
%matplotlib inline
First, we have to define the count bins by the various offenses. We determined in this step that we would only include n>25 but this can be amended to drop the threshold down as we defined the bins down to five. Colors and bin sizes can also be modified.
location=df.groupby(['Latitude','Longitude','OFNS_DESC']).size().reset_index(name='count').sort_values(by='count',ascending=False)
location=location[location['count']>25]
location['color']=location['count'].apply(lambda count:"Black" if count>=400 else
"green" if count>=200 and count<400 else
"Orange" if count>=100 and count<200 else
"darkblue" if count>=50 and count<100 else
"red" if count>=25 and count<50 else
"lightblue" if count>=10 and count<25 else
"brown" if count>=5 and count<10 else
"grey")
location['size']=location['count'].apply(lambda count:12 if count>=400 else
10 if count>=200 and count<400 else
8 if count>=100 and count<200 else
6 if count>=50 and count<100 else
4 if count>=25 and count<50 else
2 if count>=10 and count<25 else
1 if count>=5 and count<10 else
0.1)
location
The heatmap demonstrates concentration of cases. When removing the casecount filters, the findings are that crimes happen everywhere. We narrowed focus to counts greater than 25 to assess high crime volume. Manhattan across the board has high crime rates, as does Riker's Island (home of the local prison), and Bay Plaza (a shopping complex).
The first plot below has the various crimes color coded and the locations they occurred.
#we have to define the center of the map first, to base the map from
m=folium.Map([40.738, -73.98],zoom_start=11)
#location=location[0:2000]
for lat,lon,area,color,count,size in zip(location['Latitude'],location['Longitude'],location['OFNS_DESC'],location['color'],
location['count'],location['size']):
folium.CircleMarker([lat, lon],
popup=area,
radius=size,
color='b',
fill=True,
fill_opacity=0.7,
fill_color=color,
).add_to(m)
#Map is saved to file
m.save('Crime_Map.html')
m
In the following plot, we add a heatmap element to the map to show density.
location1=location[0:1326]
location_data = location[['Latitude', 'Longitude']]
# plot heatmap
m.add_child(plugins.HeatMap(location_data, radius=15))
m.save('heatmap.html')
m
Here we look at potential interactions between the attributes that we will use as predictors. The first 5 lines of our df_Crime dataframe are below.
df_Crime['CMPLNT_FR_DT'] = pd.to_datetime(df_Crime['CMPLNT_FR_DT'], format = '%m/%d/%Y')
df_Crime['dowNo'] = df_Crime['CMPLNT_FR_DT'].dt.dayofweek
df_Crime['MonthNo'] = df_Crime['CMPLNT_FR_DT'].dt.month
df_Crime.head()
Below, we have created a correlation matrix for each of the variables in the dataset.
#Corrplots intuition from https://seaborn.pydata.org/examples/many_pairwise_correlations.html
plot_corr = df_Crime.corr()
#Create Mask for upper triangle
mask = np.triu(np.ones_like(plot_corr, dtype=bool))
#Create Headmap
fig, ax = plt.subplots(figsize=(15, 12))
heatmap = sns.heatmap(plot_corr,
mask = mask,
square = True,
linewidths = .5,
cmap = 'coolwarm',
cbar_kws = {'shrink': .6,
'ticks' : [-1, -.5, 0, 0.5, 1]},
vmin = -1,
vmax = 1,
annot = True,
annot_kws = {'size': 8})
#add the column names as labels
ax.set_yticklabels(plot_corr.columns, rotation = 0)
ax.set_xticklabels(plot_corr.columns)
ax.set_title('Correlation Matrix of Variables')
sns.set_style({'xtick.bottom': True}, {'ytick.left': True})
Below, we've created a dataframe grouped by day and borough. The first five lines of this grouped dataframe are shown. Since we are taking the maximum value per day per borough, things that don't change in a day like weather, information about the day, and information about the borough will not change. What will change is the GeoCell and count_complaint. Since count_cmplnt is the target variable, we weren't going to look at it for this analysis. Aside from borough, GeoCell doesn't appear to have an interaction with any other variable. We will look at borough and GeoCell at the end of this analysis.
See the first 5 rows of the grouped dataframe below.
gpByDay = df_Crime.groupby(['CMPLNT_FR_DT', 'BORO_NM']).max()
gpByDay = gpByDay.drop(['GeoCell', 'count_cmplnt'], axis = 1)
gpByDay.head()
Below is a matrix of the correlations in the numeric data. A few variables have notably high correlations: PRCP and SNOW, MonthNo and SNOW, Temperature and MonthNo, and TMIN and TMAX. The correlations seem to be concentrated on the weather variables and the time of the year.
#Group by Correlation
gpByDay = gpByDay.reset_index()
plot_corr = gpByDay.corr()
#Create Mask for upper triangle
mask = np.triu(np.ones_like(plot_corr, dtype=bool))
#Create Headmap
fig, ax = plt.subplots(figsize=(15, 12))
heatmap = sns.heatmap(plot_corr,
mask = mask,
square = True,
linewidths = .5,
cmap = 'coolwarm',
cbar_kws = {'shrink': .6,
'ticks' : [-1, -.5, 0, 0.5, 1]},
vmin = -1,
vmax = 1,
annot = True,
annot_kws = {'size': 8})
#add the column names as labels
ax.set_yticklabels(plot_corr.columns, rotation = 0)
ax.set_xticklabels(plot_corr.columns)
ax.set_title('Correlation Matrix of Variables')
sns.set_style({'xtick.bottom': True}, {'ytick.left': True})
Here, we look at a pairs plot of the variables in the grouped dataframe. This plot was created to look at whether there are some interesting non-linear trends. There isn't a lot to see as far as interactions here, but the few that do seem to have interactions will be looked at in more detail.
axes = pd.plotting.scatter_matrix(gpByDay, figsize = (25,25));
The weather variables seem to have interactions. We start with TMIN and TMAX, which are the minimum and maximum daily temperatures, respectively. They are heavily correlated with each other, and it makes sense to only use one or the other to predict our target variable. A scatter plot was chosen since we are looking at two numeric variables and a line plot would have been too noisy.
fig, ax = plt.subplots()
plt.xlabel('Minimum Daily Temperature')
plt.ylabel('Maximum Daily Temperature')
plt.title('TMIN vs. TMAX')
plt.scatter(gpByDay['TMIN'], gpByDay['TMAX']);
Removing days without snow, there seems to be a correlation between the PRCP variable and the SNOW variable. A scatter plot was chosen since we are looking at two numeric variables and a line plot would have been too noisy.
fig, ax = plt.subplots()
plt.xlabel('Precipitation in Inches')
plt.ylabel('Snow in Inches')
plt.title('Daily Precipitation vs. Snow')
raindf = gpByDay[gpByDay['SNOW'] > 0]
plt.scatter(raindf['PRCP'], raindf['SNOW']);
There doesn't seem to be any meaningful correlation between daily max temperature and precipitation. Note that there seems to be very few days with much precipitation when max temperature is less than 25 degrees. A scatter plot was chosen since we are looking at two numeric variables and a line plot would have been too noisy.
fig, ax = plt.subplots()
plt.xlabel('Max Temperature')
plt.ylabel('Precipitation in Inches')
plt.title('Daily Max Temperature vs. Precipitation')
raindf = gpByDay[gpByDay['PRCP'] > 0]
plt.scatter(raindf['TMAX'], raindf['PRCP']);
There is an interaction between daily maximum temperature and the month, which is expected. Winter months should have a lower temperature than summer months. A bar plot was chosen since we are looking at the average of a numeric variable over levels of a categorical variable.
fig, ax = plt.subplots()
plt.xlabel('Month')
plt.ylabel('Average Max Temperature')
plt.title('Average Temperature per Month over 2014 - 2015')
plt.xticks(np.arange(1,13),['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']);
ax.xaxis.set_tick_params(rotation=45, labelsize=10)
tempByMo = gpByDay.groupby('MonthNo').mean().reset_index()
plt.bar(tempByMo['MonthNo'], tempByMo['TMAX']);
There does seem to be an interaction between precipitation and the month. A bar plot was chosen since we are looking at the average of a numeric variable over levels of a categorical variable.
fig, ax = plt.subplots()
plt.xlabel('Month')
plt.ylabel('Average Daily Precipitation in Inches')
plt.title('Average Daily Precipitation per Month over 2014 - 2015')
plt.xticks(np.arange(1,13),['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']);
ax.xaxis.set_tick_params(rotation=45, labelsize=10)
plt.bar(tempByMo['MonthNo'], tempByMo['PRCP']);
There is an interaction with daily snow and month. The snowiest month in New York seems to be February, while April through December get little to no snow. A bar plot was chosen since we are looking at the average of a numeric variable over levels of a categorical variable.
fig, ax = plt.subplots()
plt.xlabel('Month')
plt.ylabel('Average Daily Snow in Inches')
plt.title('Average Daily Snow per Month over 2014 - 2015')
plt.xticks(np.arange(1,13),['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']);
ax.xaxis.set_tick_params(rotation=45, labelsize=10)
plt.bar(tempByMo['MonthNo'], tempByMo['SNOW']);
Here is some demographic data for the different boroughs of New York. It seems that Manhattan has a significantly higher average income than the other 4 boroughs, and has the highest home sales price range. The largest boroughs by population are Brooklyn and Queens. Staten Island has by far the lowest population. A table was chosen since those 3 variables are constant for each borough across every date in the data.
gpByDay[['BORO_NM', 'Population', 'PC_INCOME', 'Hm_Sls_Price_Range']].head()
Below is a table of interactions between the holiday and events in New York. There are 615 days with no holidays or events, 97 days with events but no holidays, 9 days with holidays and no events, and 9 days with both holidays and events. A True/False table was chosen to show the interaction between holidays and events as this is the most effective way to know exactly how many days have no holidays nor events, have holidays but no events, have events but no holidays and have events and holidays.
eventdf = gpByDay.groupby('CMPLNT_FR_DT').max().reset_index()[['is_Holiday', 'is_Event']]
pd.DataFrame({'isHoliday':[len(np.where((eventdf['is_Holiday'] == 0) & (eventdf['is_Event'] == 0))[0]),
len(np.where((eventdf['is_Holiday'] == 1) & (eventdf['is_Event'] == 0))[0])],
'isEvent':[len(np.where((eventdf['is_Holiday'] == 0) & (eventdf['is_Event'] == 1))[0]),
len(np.where((eventdf['is_Holiday'] == 1) & (eventdf['is_Event'] == 1))[0]),]
})
The first five lines of our grouped dataframe are below. We are trying to predict the count_cmplnt variable, which is the amount of crimes per date, time, and geo location. We have decided to explore how each variable in this dataframe impacts the count_cmplnt variable.
df_Crime.head()
Looking at the count_cmplnt column specifically, it looks like the PC_INCOME and GeoCell variables have the greatest correlation. The other variables don't have a large correlation with count_complaint, but we will continue our analysis to see how the variables may affect count_cmplnt as correlation isn't everything.
#Corrplots
plot_corr = df_Crime.corr()
#Create Mask for upper triangle
mask = np.triu(np.ones_like(plot_corr, dtype=bool))
#Create Headmap
fig, ax = plt.subplots(figsize=(15, 12))
heatmap = sns.heatmap(plot_corr,
mask = mask,
square = True,
linewidths = .5,
cmap = 'coolwarm',
cbar_kws = {'shrink': .6,
'ticks' : [-1, -.5, 0, 0.5, 1]},
vmin = -1,
vmax = 1,
annot = True,
annot_kws = {'size': 8})
#reorder columns
#add the column names as labels
ax.set_yticklabels(plot_corr.columns, rotation = 0)
ax.set_xticklabels(plot_corr.columns)
ax.set_title('Correlation Matrix of Variables')
sns.set_style({'xtick.bottom': True}, {'ytick.left': True})
Below is a graph of the total crimes per day of the week over 2014 - 2015. The most crimes occur on Friday, and the fewest crimes occur on Sunday. There isn't a large difference between the individual days of the week, but there may be enough difference between Friday and Sunday to have an impact on our models. A bar plot was chosen since we were interested in looking at the total number of crimes across levels of a categorical variable.
dowdf = df_Crime.groupby('dowNo')['count_cmplnt'].sum().reset_index()
fig, ax = plt.subplots()
plt.xlabel('Day of Week')
plt.ylabel('Number of Crimes')
plt.title('Total Crimes per Day of Week over 2014 - 2015')
plt.bar(dowdf['dowNo'], dowdf['count_cmplnt']);
ax.xaxis.set_tick_params(rotation=45, labelsize=10)
plt.xticks(np.arange(7),['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']);
The month seems to have a large impact on the crime count. February has the lowest crime count of all months. This is not normalized for the amount of days in the month, but winter months, regardless of the amount of days, typically have less crime than summer months. The month with the highest amount of crime is July with 55,610 crimes committed over 2 years, but August isn't far behind with 55,251 crimes committed over 2 years. A bar plot was chosen since we were interested in looking at the total number of crimes across levels of a categorical variable.
modf = df_Crime.groupby('MonthNo')['count_cmplnt'].sum().reset_index()
fig, ax = plt.subplots()
plt.bar(modf['MonthNo'], modf['count_cmplnt'])
plt.xlabel('Month')
plt.ylabel('Number of Crimes')
plt.title('Total Crimes per Month over 2014 - 2015')
plt.xticks(np.arange(1,13),['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']);
ax.xaxis.set_tick_params(rotation=45, labelsize=10)
plt.show()
There seem to be a much smaller amount of crimes in Staten Island than the other four New York boroughs. Brooklyn had the most total crimes between 2014 - 2015. Crime is more likely to happen in a high population area, so we decided to normalize these numbers by population. A bar plot was chosen since we were interested in looking at the total number of crimes across levels of a categorical variable.
borodf = df_Crime.groupby('BORO_NM')['count_cmplnt'].sum().reset_index()
fig, ax = plt.subplots()
plt.xlabel('Borough')
plt.ylabel('Number of Crimes')
plt.title('Total Crimes per Borough over 2014 - 2015')
plt.bar(borodf['BORO_NM'], borodf['count_cmplnt']);
When we normalize the number of crimes by population, Queens has the lowest number of crimes and Manhattan has the highest number of crimes. Staten Island had a low total number of crimes, but a medium number of crimes per population. A bar plot was chosen since we were interested in looking at the number of crimes per person across levels of a categorical variable.
popdf = df_Crime.groupby('BORO_NM')['Population'].max()
boropopdf = borodf.join(popdf, on = 'BORO_NM')
#boropopdf['Population'] = boropopdf['Population'].str.replace(',','').astype(int)
boropopdf['Cmplnt_Pop'] = boropopdf['count_cmplnt'] / boropopdf['Population']
fig, ax = plt.subplots()
plt.xlabel('Borough')
plt.ylabel('Number of Crimes')
plt.title('Total Crimes per Borough Population over 2014 - 2015')
plt.bar(boropopdf['BORO_NM'], boropopdf['Cmplnt_Pop']);
In order to bring the latitude and longitude into the grouped analysis, we created 100 GeoCells to bin the locations of crimes. When we plot the number of crimes by location, we find a large difference between the 100 geocells. GeoCell number 87 has a two year crime count of 48,434, while many of these GeoCells have no crime at all. Crime count has a large dependence on location. A bar plot was chosen since we were interested in looking at the total number of crimes across levels of a categorical variable.
locdf = df_Crime.groupby('GeoCell')['count_cmplnt'].sum().reset_index()
fig, ax = plt.subplots()
plt.xlabel('GeoCell')
plt.ylabel('Number of Crimes')
plt.title('Total Crimes per GeoCell over 2014 - 2015')
plt.bar(locdf['GeoCell'], locdf['count_cmplnt']);
Average Home Sale Price also seems to have a large impact on the number of crimes committed. We binned average home sale price into low, medium, high and very high. The high and low sale price areas seem to have the highest amount of crime, while the medium home sale price areas seem to have the lowest crime count. A bar plot was chosen since we were interested in looking at the total number of crimes across levels of a categorical variable.
moneydf = df_Crime.groupby('Hm_Sls_Price_Range')['count_cmplnt'].sum().reset_index()
fig, ax = plt.subplots()
plt.xlabel('Home Sale Price Range')
plt.ylabel('Number of Crimes')
plt.title('Total Crimes per Home Sale Price Range over 2014 - 2015')
plt.bar(moneydf['Hm_Sls_Price_Range'], moneydf['count_cmplnt']);
When we look at the number of crimes per day vs. the daily high temperature, we can see that crime seems to increase lineraly as temperature increases. This seems to confirm what we saw in the plot of crimes per month: the colder winter months have lower amounts of crime, while the warmer summer months have larger amounts of crime. A scatter plot was chosen since we are looking at two numeric variables and a line plot would have been too noisy.
datedf = df_Crime.groupby('CMPLNT_FR_DT')['count_cmplnt'].sum().reset_index()
temp_df = df_Crime.groupby('CMPLNT_FR_DT')['TMAX','TMIN', 'SNOW', 'PRCP'].max();
tempdf = datedf.join(temp_df, on = 'CMPLNT_FR_DT')
fig, ax = plt.subplots()
plt.scatter(tempdf['TMAX'], tempdf['count_cmplnt']);
plt.xlabel('High Temperature')
plt.ylabel('Number of Crimes')
plt.title('Total Crimes per High Temperature over 2014 - 2015');
The daily low temperature is highly correlated with the daily high temperature, so the plot of low temperature vs. number of crimes is nearly identical to the high temperature vs. number of crimes. A scatter plot was chosen since we are looking at two numeric variables and a line plot would have been too noisy.
fig, ax = plt.subplots()
plt.scatter(tempdf['TMIN'], tempdf['count_cmplnt'])
plt.xlabel('Low Temperature')
plt.ylabel('Number of Crimes')
plt.title('Total Crimes per Low Temperature over 2014 - 2015');
Here, we look at the amount of snow vs. the crime count. It does not snow very often in New York so it is hard to tell if there is a trend with so few data points, but number of crimes does seem to have a negative linear dependence on the amount of snow that New York recieves. A scatter plot was chosen since we are looking at two numeric variables and a line plot would have been too noisy.
snowdf = tempdf[tempdf['SNOW'] > 0]
fig, ax = plt.subplots()
plt.scatter(snowdf['SNOW'], snowdf['count_cmplnt'])
plt.xlabel('Snow in Inches')
plt.ylabel('Number of Crimes')
plt.title('Total Crimes and Snow over 2014 - 2015');
Here, we look at the amount of daily rain that New York recieved vs. the number of crimes that day. Removing days where there was no rain, there doesn't seem to be a trend here. If there is, there is a slight negative trend. A scatter plot was chosen since we are looking at two numeric variables and a line plot would have been too noisy.
prcpdf = tempdf[tempdf['PRCP'] > 0]
fig, ax = plt.subplots()
plt.scatter(prcpdf['PRCP'], prcpdf['count_cmplnt'])
plt.xlabel('Precipitation in Inches')
plt.ylabel('Number of Crimes')
plt.title('Total Crimes and Precipitation over 2014 - 2015');
Correlation matrix of daily crimes committed vs. the temperature, snow and precipitation. This confirms what we saw in the plots: crimes are highly positively correlated with temperature, slightly negatively correlated with the amount of snow, and very slightly negatively correlated with rain.
#Corrplots
plot_corr = tempdf.corr()
#Create Mask for upper triangle
mask = np.triu(np.ones_like(plot_corr, dtype=bool))
#Create Headmap
fig, ax = plt.subplots(figsize=(10, 8))
heatmap = sns.heatmap(plot_corr,
mask = mask,
square = True,
linewidths = .5,
cmap = 'coolwarm',
cbar_kws = {'shrink': .5,
'ticks' : [-1, -.5, 0, 0.5, 1]},
vmin = -1,
vmax = 1,
annot = True,
annot_kws = {'size': 15})
#reorder columns
#add the column names as labels
ax.set_yticklabels(plot_corr.columns, rotation = 0)
ax.set_xticklabels(plot_corr.columns)
ax.set_title('Correlation Matrix of Crimes Committed vs Weather Conditions')
sns.set_style({'xtick.bottom': True}, {'ytick.left': True})
Here, we look at the time of day vs. the crime count. The most popular times for crime seem to be Day, Evening and Night, while Early Morning, Late Night and Morning don't seem to be nearly as popular. A bar plot was chosen since we were interested in looking at the total number of crimes across levels of a categorical variable.
dtdf = df_Crime.groupby('Daytime')['count_cmplnt'].sum().reset_index()
fig, ax = plt.subplots()
plt.bar(dtdf['Daytime'], dtdf['count_cmplnt'])
plt.xlabel('Time of Day')
plt.ylabel('Crime Count')
ax.xaxis.set_tick_params(rotation=45, labelsize=10)
plt.title('Total Crimes per Time of Day over 2014 - 2015');
As mentioned previously, our original dataset was somewhat limited as most of the features are categoricals and timestamps with some geospatial data such as longitude and latitude as well as x and y coordinates. Therefore, we've identified external datasets as well as developed new features from the original dataset attributes to help enhance our data.
Several new features have been created from the original data set by:
New external data has been joined with our original data set to give us:
Using the new Daytime and Day_Name features, we can visualize when crimes are occurring most often throughout the week. Below, we can see that crimes most often occur in the evening and night and between Tuesday and Friday of the week. These new variables appear to be very useful in determining patterns in crime behavior.
df_datime_wkday = df.groupby(['Day_Name','Daytime']).agg(count_cmplnt=('CMPLNT_NUM','count')).reset_index()
daySort = ["Monday","Tuesday","Wednesday","Thursday", "Friday", "Saturday", "Sunday"]
timeSort = ["Early Morning", "Morning", "Day", "Evening", "Night", "Late Night"]
alt.Chart(df_datime_wkday).mark_rect().encode(
x = alt.Y('Day_Name:N', sort=daySort),
y= alt.Y('Daytime:N',sort=timeSort),
color='count_cmplnt:Q',
tooltip = ['Day_Name', 'count_cmplnt','Daytime']
).properties(title="Crimes volumes by Time of Day and Day of Week")
Using the new "GeoCell" feature, we can visualize where on the map we have higher densities of crimes without being constrained by Borough or neighborhood boundaries. Each x y coordinate maps to a cell. The GeoCell feature was created to help provide a generalized meaning to location for each record, similar to cutting a continuous variable into a new categorical feature. This may prove beneficial for modeling instead of using exact xy coordinates which may lead to overfitting.
df_XY_ofns = df.groupby(['GeoCell_X', 'GeoCell_Y']).agg(count_cmplnt=('CMPLNT_NUM','count')).reset_index()
myXSort = ["1","2","3","4","5","6","7","8","9","10"]
myYSort = [10,9,8,7,6,5,4,3,2,1]
alt.Chart(df_XY_ofns).mark_rect().encode(
x = alt.X('GeoCell_X:O', sort=myXSort),
y=alt.Y('GeoCell_Y:O', sort=myYSort),
color='count_cmplnt:Q',
tooltip = ['GeoCell_X', 'GeoCell_Y','count_cmplnt']
).properties(title="Crimes volumes by GeoCell")
Dividing up the original 68 different types of crime offenses into 4 distinct buckets allows several avenues to explore classification-based modeling for identifying types of crimes being committed in certain geocells/boroughs.
The below code allocates the various types of crimes into the 4 distinct buckets: Personal, Property, Statutory, and Other
#Replace all Hyphons and forward slashes with spaces
def string_replace(column):
newstrings = []
for string in column:
for word in string:
if word == '-':
newstrings.append(string.replace('-', ' '))
break
elif word == '/':
newstrings.append(string.replace('/', ' '))
break
else:
newstrings.append(string)
return ''.join(newstrings)
#Create function to apply new bin labels for crimes types
def crime_apply(string):
for word in string.split():
if word in personal_list:
return "Personal Crime"
elif word in property_list:
return "Property Crime"
elif word in statutory_list:
return "Statutory Crime"
else:
return "Other"
#Create categorical list of all 68 possible offenses
#4 types of crime: Personal, Property, Statutory/State, and Other
#Each bucket was derived using intel from https://www.justia.com/criminal/offenses/
#Personal - Assault, Battery, Abuse, Kidnapping, Rape, Homicide, Harrassment
personal_list = ['ASSAULT', 'BATTERY', 'MURDER', 'MANSLAUGHTER', 'HARRASSMENT', 'KIDNAPPING', 'RAPE', 'SEX', 'HOMICIDE', 'WEAPON', 'WEAPONS', 'PERSON']
#Property - Arson, Theft, Mischief, Burglary, Larceny, Robbery, Forgery, Embezzlement, Shoplifting, Trespass, Fraud/Financial
property_list = ['ARSON', 'THEFT', 'THEF', 'MISCHIEF', 'ROBBERY', 'BURGLARY', 'JOSTLING', 'TOOLS', 'FRAUD', 'FRAUDS', 'FRAUDULENT', 'LARCENY', 'FORGERY', 'STOLEN', 'UNAUTHORIZED', 'TRESPASS']
#Statutory - State-Related offenses, DUI, DWI, Gambling, Loitering, Drug, Traffic, Public Servant/Admin-related crimes
statutory_list = ['DRUGS', 'DRUG', 'SENSBLTY', 'PENAL', 'PUBLIC', 'INTOXICATED', 'TRAFFIC', 'ADMINISTRATIVE', 'GAMBLING', 'PROSTITUTION', 'ALCOHOLIC', 'DISORDERLY']
#Other - everything else
The below code ouputs the count of crimes in the dataset by the 4 new buckets of crime types. Property crime has the largest share of observations with 525,885 crimes and the least volume of crimes being that of "Other".
#Display Crime Counts with the new feature
df1 =df_old[df_old['OFNS_DESC'].notna()]
df1['OFNS_DESC'] = df1.apply(lambda x: string_replace(x['OFNS_DESC']), axis = 1)
df1['OFNS_DESC_TYPE'] = df1.apply(lambda x: crime_apply(x['OFNS_DESC']), axis = 1)
df1['OFNS_DESC_TYPE'].value_counts()